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The substrate recognition regions in cytochrome 
P450 family 2 (CYP2) proteins were inferred by 
QQ group-to-group alignment of CYP2 sequences and 
those of bacterial P450s, including Pseudomonaa pu- 
tida P450 101 A (P450„„„), whose substrate-binding 
{0 residues have been definitely identified by x-ray crys- 
tallography of a substrate-bound form (Poulos T. L., 
Finzel. B. C, and Howard, A. J. (1987) J. MoL Biol. 
195, 687-700). The six putative substrate recognition 
sites, SRSs, thus identified are dispersively located 
along the primary structure and constitute about 16% 
of the total residues. All the reported point mutations 
and chimeric fragments that significantly affect the 
substrate specificities of the parental CYP2 enzymes 
fell within or overlapped some of the six SRSs. Analy- 
^ sis of nucleotide substitution patterns in closely related 
§ members in four subfamilies, CYP2A, 2B, 2C, and 2D, 
•ft consistently indicated that the SRSs have accumulated 
M more nonsynonymous (amino acid-changing) substitu- 
tions than the rest of the sequence. This observation 
"5 supports the idea that diversification of duplicate genes 
* of drug-metabolizing P450s occurs primarily in sub- 
strate recognition regions to cope with 
number of foreign compounds. 



The cytochrome P450 (P450)' monooxygenase system plays 
a central role in the metabolism of a wide variety of foreign 
compounds such as plant metabolites, environmental pollu- 
tants, and drugs. The human and rodent genomes contain at 
least 50 P450 genes, which are classified into 10 families 
according to the currently available protein sequence data 
(Nebert et al. , 1991). Enzymes belonging to families 1- through 
4 are mainly concerned with catabolism of thousands of 
endogenous and exogenous chemicals, whereas members of 
other mammalian P450 families catalyze specific reactions 
involved in physiologically important pathways of synthesis 
of steroid hormones, prostaglandins, and vitamin D:i (Gon- 
zalez, 1990). The exact number of active P450 genes in a 
mammalian genome which encode drug-metabolizing P450s 
is not known but may not greatly exceed 50. These P450 
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enzymes show partial overlap but distinct substrate specific- 
ities. Thus it is of great interest to elucidate the molecular 
mechanisms underlying the broad but specific metabolic ca- 
pacities of the mammalian P450 systems consisting of rela- 
tively limited numbers of catalysts. 

A primary question is which parts of a P450 protein are 
involved in recognition or binding of substrates and hence 
determine the substrate specificity. There have been various 
experimental and analytical investigations on this problem, 
including chemical modifications with substrate analogues 
(Onoda et al., 1987), site-directed mutagenesis (Imai and 
Nakamura, 1989; Aoyama et al, 1989; Lindberg and Negishi, 
1989; Matsunaga et aL, 1990b; Zhou et al, 1991), protein 
engineering with chimeric constructs (Sakaki et aL, 1987; 
Imai, 1988; Pompon and Nicolas, 1989; Uno and Imai, 1989; 
Kronbach et al, 1989; Uno et al., 1990), searching for similar 
subsequences in other protein families with similar substrate 
or binding specificities (Gotoh et al., 1985; Picado-Leonard 
and Miller, 1988), and sequence alignment (Gotoh and Fujii- 
Kuriyama, 1989; Laughton et al., 1990) with P450 101 A, the 
sole P450 whose substrate-binding residues have been defi- 
nitely identified from the three-dimensional structure (Poulos 
et al., 1985, 1987). The results of these studies have been 
controversial, and no unified view about the substrate recog- 
nition sites in mammalian P450s has been established. 

One reason for this confusion arises from the difficulty in 
aligning distantly related protein sequences of a mammalian 
P450 and bacterial P450 lOlA. In fact, a mammalian P450 
and P450 lOlA show only 12-20% identity of amino acids 
(Nelson and Strobel, 1987; Gotoh and Fujii-Kuriyama, 1989), 
and the typical alignment scores of 3.0-5.0 S.D. (Gotoh et al., 
1983; Gotoh and Fujii-Kuriyama, 1989) indicate that only 
marginal accuracy can be expected in pairwise alignments 
(Barton and Sternberg, 1987). However, a large number of 
mammalian P450 sequences are now available, and those of 
members of a given family are easily aligned. Several bacterial 
P450s with similar sequences to P450 lOlA have also been 
reported (Nebert et aL, 1991). Hence, it is possible to improve 
the accuracy of alignment between mammalian and bacterial 
sequences by a group-to-group comparison. Further improve- 
ment in the accuracy should be achieved by taking into 
consideration not only primary structures but also some prop- 
erties associated with higher order structures such as second- 
ary structure prediction (Gamier et al., 1978; Gibrat et al., 
1987) and hydropathy indices (Kyte and Doolittle, 1982). 
Taking these properties into consideration, I have modified 
our basic alignment algorithms (Gotoh, 1982, 1990). This 
paper presents potential substrate recognition sites in mam- 
malian P450s identified using an amended alignment. 

The versatility of the protective cytochrome P450 systems 
against foreign compounds is reminiscent of those of globulin 
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families, such as immunoglobulin, T-cell receptor, and major 
histocompatibility complex. On analysis of nucleotide substi- 
tution patterns in highly polymorphic major histocompatibil- 
ity complex Class I and Class 11 genes, Hu^es and Nei (1988, 
1989) found that nonsynonymous (amino acid-replacing) co- 
don changes occur at high rates within the antigen recognition 
site (Bjorkman et al., 1987). A similar concentration of non- 
synonymous codon changes has also been recognized in re- 
gions corresponding to antigen-binding sites in immunoglob- 
ulin genes (Tanaka and Nei, 1989). These observations are 
interpreted as a result of positive Darwinian evolution 
(Hughes and Nei, 1988, 1989). Since the diversification of 
drug-metabolizing P450 genes in animals seems to have been 
promoted through adaptation to toxic materials in foods 
(Krieger et ai, 1971; Gonzalez and Nebert, 1990), the conse- 
quences of adaptive evolution might be traced in the nucleo- 
tide sequences of P450 genes as in major histocompatibility 
complex or immunoglobulin genes. 

Keeping this possibility in mind, I analyzed the coding 
nucleotide sequences of P450 family 2 (CYP2) genes in detail 
and found that the regions encoded by codons with excessively 
high nonsynonymous substitutions coincided well with the 
potential substrate-binding sites inferred from the refined 
alignment between CYP2 and bacterial P4S0 protein se- 
quences. The CYP2 family was chosen for study because it 
consists of many related genes that are suitable for analysis 
of nucleotide substitution patterns and because there have 
been many experimental studies on members of this family. I 
show here that all the known point mutations and chimeric 
fragments that alter substrate specificities of some members 
of the CYP2 family are located within or very close to the 
putative substrate recognition sites identified by comparative 
sequence analysis methods. 

MATERIALS AND METHODS 

Nucleotide and Amino Acid Sequence Dofa— All coding nucleotide 
sequences were taken directly from the literature and cross-checked 
with the data in GenBank Rel. 66 when possible. Two sequences were 
regarded as polymorphic when their derived amino acid sequences 
differed by less than 1%, and the sequence that was complete or 
closest to the consensus sequence was selected. Thus, 51 CYP2 (2A1- 
7, 2B1-5, 7, 9, 10, 13 with two entries for each 2B4 and 2B5, 2C1-9, 
11-14, 16, 18, 19, 21, 22, 2D1-6, 9, 10, three mammalian 2E1, 2F1, 
2Gl, 2H1, and 2H2), 12 CYPl (five mammalian lAl, six mammalian 
1A2, and one fish lAl), eight CYP3A (3A1-8), and eight bacterial 
(lOlA, 103A, 104A, 105A-C, 106A, and 55A) sequences were used for 
protein sequence analysis. Fusarium oxysporum P450 55A was in- 
cluded in the bacterial group because this gene was probably inte- 
grated into the fungal genome via a recent horizontal gene transfer 
event from a bacterium related to Streptomyces (Kizawa et ai., 1991). 
The codon substitution patterns of the sequences of CYP2A, 2B, 2C, 
and 2D subfamily members were analyzed. 

Alignment of Groups of Protein Sequences— CYPl, 2, and 3 and 
bacterial amino acid sequences were aligned within each group by the 
methods described previously (Gotoh and Fujii-Kuriyama, 1989; Go- 
toh, 1990). The sequences of two groups were aligned using a few 
modifications of the standard dynamic programming algorithm with 
linear gap weights (Gotoh, 1982). First, a column in the original 
alignment was converted asymmetrically to either the '"frequency" 
or "profile" vector, as described by Gribskov et al. (1987). In calculat- 
ing a frequency or profile, we used a set of weights to even the 
contributions of individual sequences (Gotoh and Fujii-Kuriyama, 
1989; Altschul et al., 1989). Second, we introduced four additional 
elements in a frequency or profile vector, i.e. helix-, sheet-, and coil- 
forming propensities calculated by the method of Gibrat et al (1987) 
and a hydropathy index obtained with the parameters of Kyte and 
Doolittle (1982) with a window size of 9. A similarity score used in 
the alignment process was the scalar product of frequency and profile 
vectors. The contributions of the three parts, i.e. primary structure, 
secondaty structure, and hydropathy, may be changed with normal- 
izing factors. In the present analysis, we adjusted these factors to 
0.5:0.25:0.25, where each factor stands for the portion of a score 



coming from the correspondLng part for two identical random se- 
quences with the average amino acid compositions. The value for 
each element is an average of those computed for individual se- 
quences, in which we used the same weights as those in calculating 
the profiles. 

Anafysis of Nucleotide Substitution Patterns— The coitoig nucleo- 
tide sequences of CYP2A, 2B, 2C, and 2D members were aligned 
within each subfamily according to protein sequence alignments. The 
few gaps found in the alignments were ignored in the following 
analysis. The numbers of synonymous, JV,(A, B), and nonsynony- 
mous, JV„(A, B), sites and the numbers of synonymous, n.(A, B), and 
nonsynonymous, /i„(A, B), substitutions between a pair of sequences, 
A and B, were estimated by the method of Nei and Gojobori (1986). 
The fractions of synonymous and nonsynonjTOOUS substitutions are 
defined as p. = re,(A, B)/N.(A, B) and p„ = n.(A, B)/N,(A, B), 
respectively. Let n..,(A, B) and rt„,(A, B) be the numbers of synony- 
mous and nonsynonymous substitutions between sequence A and B 
within a window of the size of w codons centered at the ith codon 
position (u)/2 ■< i ■< L - w/2, where L is the total number of codons 
in A or B). R.,(A, B) ^ n„(A, B).Z,/ra.(A, B)-w and ie„,(A, B) ^ 
n„i(A, B)-jL/n„(A, B) • w indicate local variations in synonymous and 
nonsynonymous substitutions normalized with the expected numbers 
based on the global values. Assuming that i?a and R„, are intrinsic to 
the codon position i, regardless of the sequence pairs considered, we 
estimated the most plausible value for R„ or by linear regression 
analysis, i.e. ft, (or R„i) was estimated from the regression coefficient 
of n,^ (or re„i) on n,-w/L (or n„-w/L) for various pairs of sequences. 
Since re„ and re„,- tend to become saturated as the sequence divergence 
increases, we considered only pairs for which (n, + n„)/L £ 0.17, i.e. 
overall nucleotide sequence divergences are less than 17%. This 
eliminated all interspecies comparisons, except a few between rodent 
sequences. We also imposed weights to individual pairs, obtained 
according to Model I of Altschul et oL (1989), to correct for otherwise 
excessive contributions of phylogenetically differentiated pairs. 

RESULTS 

Structural Conservation between CYP2 and Bacterial 
F450s— Fifty-one CYP2 sequences and eight bacterial se- 
quences were aligned with each other by the group-to-group 
alignment method as described under "Materials and Meth- 
ods." The result is illustrated in Fig. 1, in which the profiles 
of secondary structure information (a-c) and hydropathy 
indices (d) averaged over CYP2 proteins (upper profUe in each 
panel) are juxtaposed together with those for bacterial P450b 
{lower profile). A horizontal flat line in a profile indicates the 
location of a gap (a span of deleted residues) introduced in 
the alignment process. The longest gaps are located between 
Helices F and G and between Helices J and K. Except for 
these gaps and about 30 positions at the N terminus, the 
CYP2 profiles clearly resemble the bacterial profiles. The 
dissimilarity in the N-terminal region is reasonable because 
this region is decisive for the difference in subcellular local- 
izations of membrane-bound microsomal P450s and soluble 
bacterial P450s (Sakaguchi et al., 1987; Vergeres et al, 1989). 

To evaluate the alignment more quantitatively, I examined 
how the secondary structure or hydropathy index at each site 
in CYP2 correlates with that for bacterial P450s at the same 
alignment position. As references for this comparison, several 
pairs of P450 groups with various degrees of sequence diver- 
gence were examined similarly, and results are summarized 
in Fig. 2. All but one of the correlation coefficients exceeded 
0.5. Even the weakest correlation between the (3-structure 
indices of CYP2 and bacterial P450s (r(sheet) = 0.42) is highly 
significant (t = 9.4, p < 10"'"). The correlation coefficients 
gradually decline as sequence divergence increases. For all 
comparisons, a definite order was observed among the four 
correlation coefficients: i.e. r(hydTopathy) > r(coil) > r(helix) 
> r(sheet). The absence of abnormality in the CYP2-bacterial 
comparisons suggests that the quality of the alignment is 
reasonably good and is similar to those between mammalian 
P450 families. 
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Fig. 1. Predicted secondary structure information and hy- 
dropathy index calculated for collective CYP2 sequences and 
bacterial sequences. Helix- (a), sheet- (fc), and coil- (c) forming 
propensities determined by the method of Gibrat et al. (1987). d, 
hydropathy profiles obtained with the method of Kyte and DooUttle 
(1982). In each panel, the upper profile for CYP2 sequences and the 
lower profile for bacterial sequences are jujctaposed according to the 
optimal alignment between the two groups of sequences. The hatched 
and cross-hatched areas indicate the locations of «-helices and 0- 
structures, respectively, in P450 lOlA (Poulos et al., 1987). The labels 
above the panels are according to Poulos et al. (1987), except for the 
Cys ligand loop, which is labeled 6. 




Fig. 2. Correlation coefficients of secondary structure in- 
formation and hydropathy indices between various groups of 
P450 sequences. Correlation coefficients were calculated from the 
values for helix (•), sheet (O), or coil (A) information and the 
hydropathy index (x) at the same positions in alignments between 
various pairs of P450 groups. /, //, and ///, the CYPl, 2, and 3 
families, re^ectively; AB, combined sequences of subfamilies 2A and 
2B; C, the 20 subfamily; D, the 2D subfamily; D, the CYP2 sequences 
other than those of 2D; Be, bacterial sequences closely related to 
lOlA. The abscissa indicates the average percentage of amino acid 
identities between the two sequence groups compared. 

The hatched areas in Fig. 1 indicate the positions corre- 
sponding to the 13 helices in P450 lOlA (Poulos et al., 1987), 
and the cross-hatched areas indicate those corresponding to 
/?-stnictures. The labels at the top of the figure of the a-helical 
and i3-sheet regions are according to the notation of Poulos et 
al. (1987), except the Cys ligand loop, which is labeled 6. Most 
of the tentative helical regions in both CYP2 and bacterial 
sequences have strong helix-forming propensities. Of 210 sites 



at which P450 lOlA residues have a helical conformation, 141 
(67%), 150 (71%), and 146 (70%) sites are predicted to be a- 
helical with the original P450 lOlA sequence, collective bac- 
terial sequences, and collective CYP2 sequences, respectively, 
when the decision constants in each case were adjusted so 
that the predicted total contents of helices and sheets fit those 
of the three-dimensional structure of P450 lOlA. Edwards et 
at (1989) reported a similar degree of predictivity of a-helices 
using an older version of the method of Gamier et al. (1978). 
The accuracies of prediction for /3-structures were worse, being 
44%, 40%, and 35% for the three sets of sequences. This poor 
prediction is probably related to the facts that ^-structures 
are rare in P450 proteins, and /?-structure information was 
least conserved in different groups of P450s (Fig. 2). The 
cumulative accuracies for the three states (helix, /?-structure, 
and coil) were 59%, 60%, and 56%, respectively, for the three 
sequence sets. These values are close to the mean (58%) for 
the whole data base reported by Gibrat et al. (1987). The most 
important finding here was that nearly the same degrees of 
prediction were attained with CYP2 sequences and with the 
original P450 lOlA sequence, although on the average these 
sequences share only 13% of identical amino acids. This fact 
strongly supports the idea that the three-dimensional struc- 
ture of P450 lOlA is basically retained in CYP2 proteins, as 
suggested in previous reports (Fujii-Kuriyama et al., 1987; 
Gotoh and Fujii-Kuriyama, 1989; Nelson and Strobel, 1989). 

Alignment-based Prediction of Substrate Recognition Sites 
in CYP2 Proteins — X-ray crystallography of substrate-bound 
forms of P450 101 A (Poulos et al, 1985, 1987) showed that 
the substrate camphor interacts with protein residues dis- 
persed in several separate loci in the primary structure. Based 
on the three-dimensional model, Laughton et al. (1990) listed 
the residues in P450 lOlA that are within 10 A of the bound 
camphor molecule. Shaded areas in Fig. 3 indicate the loca- 
tions of these substrate-binding residues on our alignment 
between CYP2 and bacterial P4608. The CYP2 members in 
Fig. 3 are selected to represent all the gaps within the CYP2 
alignment, except those at the extreme N or C termini. 

The encircled residues in Fig. 3 are taken from mutant or 
chimeric cDNA clones of P450 2A4/2A5 (Lindberg and Neg- 
ishi, 1989), P450 2B2 (Aoyama et al, 1989), P450 2C4/2C5 
(Kronbach et aL, 1989), and P450 2D1 (Matsunaga et al., 
1990b) that show markedly different substrate specificities 
(rather than enzymatic activities) from the parental clones 
upon expression. These residues form three clusters, all of 
which lie within, or in the close vicinity of (not more than 
three amino acid residues apart from) the alignment positions 
corresponding to the substrate-binding sites of P450 lOlA 
(Fig. 3). These encircled residues are not the minimal set 
responsible for the altered substrate specificities. For example, 
amino acids changed at only one or two of the three encircled 
sites in the B'-C region might be enough for the different 
catalytic activities of P450 2C4 and P450 2C5 (Kronbach et 
al, 1989). 

Imai and co-workers (Imai, 1988; Uno and Imai, 1989; Uno 
et ai., 1990) constructed several chimeras between P450 2C2 
and 2C14 and found three separate regions that are likely to 
interact with substrate molecules. The sites corresponding to 
these three regions are indicated by solid lines beneath the 
P450 2C4 sequence in Fig. 3. All three regions cover some of 
the alignment positions corresponding to the P450 lOlA 
substrate-binding residues. Although not shown in Fig. 3, 
residues in the distal helix (Helix I) are known to be involved 
in binding of substrates (Poulos et al, 1985, 1987; Imai and 
Nakamura, 1988, 1989; Furuya et al., 1989a, 1989b; Zhou et 
al, 1991). Thus, the substrate recognition sites in CYP2 




proteins suggested from the alignment are fully compatible 
with available experimental observations. 

From all these observations, the following six separate 
regions were tentatively assigned as substrate recognition 
sites (SRSs) in CYP2 proteins: 1) B' and flanking areas (103- 
126), 2) the C-terminal end of Hehx F (209-216), 3) the N- 
terminal end of Helix G (248-255), 4) the N-terminal half of 
Helix I (302-320), 5) the ^3 area (375-385), and 6) a central 
region of fib (485-493), where the numbers in parentheses 
refer to the position in the alignment shown in Fig. 3. These 
regions correspond to the P450 lOlA substrate -binding sites 
extended by three amino acid residues on both sides. Since 



there are only a few residues between the first and second of 
these regions, these two regions were fused and the combined 
area was named SRS-1. SRSs account for 76-79 of the total 
512 positions in the CYP2 alignment. 

Nucleotide Substitution Patterns in Members of CYP2 
Subfamilies— To confirm the assignment of substrate recog- 
nition sites in CYP2 proteins, I examined local sequence 
variability between duplicate members within a subfamily. 
The underlying hypothesis for this was that closely related 
drug- metabolizing P450s may show divergent substrate spec- 
ificities and so cooperate in metabolizing a wider range of 
foreign compounds. If this is so, the amino acid sequences of 
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substrate recognition sites should be more variable than the 
rest of the molecule. To locate variable regions, I examined 
coding nucleotide sequences rather than amino acid se- 
quences, since nucleotide sequences provide much informa- 
tion about the molecular evolutionary mechanisms operating 
in diversification of closely related genes. 

Fig. 4 plots ARi = Rni ~ R,i at each codon position calculated 
for each of four CYP2 subfamilies. fl„, is the mean ratio of 
the real to expected numbers of nonsynonymous substitutions 
within a window centered at the codon position i, and is a 
similar value for synonymous substitutions (see "Materials 
and Methods"). Thus if Mti is positive, the region covered by 
the window has accumulated a larger number of nonsynony- 
mous nucleotide substitutions than of synonymous substitu- 
tions, whereas if Afl, is negative, amino acid changes within 
the window are fewer than expected from the average nucleo- 
tide substitution rates. A good correlation between the regions 
with positive AfJ, values and SRSs (Fig. 4, shaded areas) is 
apparent for all subfamilies, tests with 2x2 contingency 
tables (Table I) confirmed the significant associations in all 
cases: = 39.0, 37.7, 49.2, and 28.9 (»10.8, p < 0.001 with 
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Fig. 4. Local variations in difference between rates of non- 
synonymous nucleotide substitutions and synonymous substi- 
tutions. Positive ARi means that nucleotide substitutions replacing 
amino acids within a window (nine codons) have occurred more 
frequently than expected from the global number of nonsynonymous 
substitutions after correction for local variation in nucleotide muta- 
tion rates. Calculations were made individually for four CYP2 
subfamilies, 2A (a), 2B (b), 2C (c), and 2D (d). Shaded areas indicate 
SRSs. The locations of the residues identified experimentally as 
responsible for substrate recognition are indicated by arrowheads. 
The three filled boxes in c indicate the locations of the fragments that 
affect substrate specificities in chimeras between rabbit P450 2C2 
and 2C14. The potential helical and /S-structure regions are indicated 
above the panels by boxes. The same labels as in Figs. 1 and 3 are 



1° of fireedom) for 2A, 2B, 2C, and 2D, respectively. The 
correlation is particularly good for the 2C subfamily, presum- 
ably because this 2C subfamily contains the largest number 
of members and its evolutionary process accords well with our 
hypothetical scheme. As an exception, the AR, values in SRS- 
5 (i83 area) are not positive. Since this region is narrow and 
is flanked by well conserved residues such as GIu and Arg in 
Helix K and His or Arg in (83 (Gotoh and Fujii-Kuriyama, 
1989), amino acid changes in SRS-5 might be restricted 
compared with those in other SRSs. 

Table II lists the fractions of synonymous nucleotide sub- 
stitutions per synonymous site (p.) and fractions of nonsy- 
nonymous substitutions per nonsynonymous site (p„) be- 
tween intraspecies P450 genes calculated separately within 
and outside SRSs. As expected, p„/p» values for SRSs are all 
larger than those for other regions. The values outside SRSs 
(Pn/P» = 0.30 ± 0.08) are fairly constant in all pairs and are 
simDar to the values observed in various genes (Miyata et al, 
1980; Nei, 1987). (The unusually large value for rat 2D1 and 
2D5 is probably due to the large scale gene conversion between 
these genes (Matsunaga et oL, 1990a) and hence is omitted 
from the present statistics.) On the other hand, the p„/p, 
values for SRSs vary extensively (p„/p, = 0.74 ± 0.35). Al- 
though p„ > p. for some pairs, the inequality is not significant 
in any case. 



We (Gotoh et al, 1983; Gotoh and Fujii-Kuriyama, 1989) 
and others (Black and Coon, 1987; Kalb and Loper, 1988; 
Nelson and Strobel, 1987, 1988; Edwards et al., 1989; Laugh- 
ton et aL, 1990) have reported several versions of alignment 
between bacterial and eukaryotic (including CYP2) P450 se- 
quences. These alignments are generally consistent in the G- 
terminal half of the sequences (Helix I to the C terminus) in 
which conserved elements common to all P450s are located 
(for review, see Gotoh and Fujii-Kuriyama, 1989). In the N- 
terminal half, the regions of Helices C, D, and G are relatively 
well conserved, and even the earliest version (Gotoh et ai, 
1983) between single sequences (2B1 uersits lOlA) is basically 
the same as that shown in Pig. 3 in terms of alignment of 
these regions. In contrast, the presumable substrate-binding 
regions exhibit extensive sequence variations, and so it has 
been hard to obtain unequivocal alignment of these regions. 
However, we now have the following reasons for the global 
correctness of our alignment (Fig. 3), including the potential 
substrate recognition regions. 

First, the profiles of secondary structure information and 
hydropathy indices obtained with bacterial and CYP2 se- 
quences mat«h each other very well (Fig. 1). I was not the 
first to include predicted secondary structure information into 
alignment of P450 sequences. Edwards et al (1989) mainly 
used helix-forming propensities to align various families of 
P450 sequences. Their alignment differs from ours in several 
respects and particularly in the locations of Helices E and F. 
This is not surprising because the Helix E region shows 



Table I 

Contingency tables for test of excess nonsynonymous substitutions in SRSs 
The numbers of residues with positive {+) and negative (-) ARi values (Fig. 4) within and outside SRSs are 
counted separately, x' values were obtained with Yates correction. 



AR 


CYP2A(x' = 


39.0) 


CYP2B(x» = 37.7) 


CYP2C (x" = 


49.2) 


CYP2D (x' = 2 


8.9) 




Total 


+ - Total 




Total 




Total 


SRSs 


57 19 


76 


61 15 76 


61 15 


76 


59 20 


79 


Others 


150 269 


419 


173 246 419 


152 268 


420 


175 251 


426 


Total 


207 288 


495 


234 261 495 


213 283 


496 


234 271 


505 
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Table II 

Fractions of synonymous ( pj and nonsynonymous (p,J nucleotide 



siAstitutions outside and inside SRSs 



Genes 



Rat 

Rat 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Mouse 

Human 

Human 

Human 

Human 

Human 

Human 

Rat 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Rabbit 

Rat 

Rat 



2A1 2A2 
2A4 2A5 
2B1 2B2 
2B1 283 
2B2 283 
2B4a 2B4b 
2B4a 2B5a 
2B4b2B5a 
2B4a 2B5b 
2B4b 2B5b 
2B5a 2B5b 
2B9 2B10 
2C8 2C9 
2C8 2C18 
2C9 2C18 
2C8 2C19 
2C9 2C19 
2C18 2C19 
2C12 2C13 
2C1 2C2 
2C1 2C14 
2C2 2C14 
2C4 2C5 
2C4 2C16 
2C5 2C16 
2D1 2D3 
2D1 2D5 
2D3 2D5 
2D9 2D10 



2.4 ± 0.4 



16.6 



1.1 



.6 ± 1.1 
1.3 + 0.3 
2.2 ± 0.4 
2.2 ± 0.4 
2.9 ± 0.5 
2.6 ± 0.5 
3.0 ± 0.5 



10.9 ± 1.0 
11.1 ± 1.0 
0.5 ± 0.2 
0.8 ± 0.3 



11.J 



±0.9 



12.7 ± 2.0 

2.8 ± 1.0 
6.1 ± 1.4 

29.0 ± 2.7 
29.2 ± 2.7 

3.9 ± 1.1 
5.8 ± 1.3 
4.8 ± 1.2 
6.8 ± 1.4 
5.5 ± 1.3 
5.8 ± 1.3 

23.0 ± 2.5 
27.6 ± 2.7 
27.2 ± 2.7 
23.9 ± 2.6 
28.9 ± 2.7 
9.0 ± 1.7 



23.7 



:2.5 



3.0 ± 0.6 

2.5 ± 0.5 
9.3 ± 0.9 

1.6 ± 0.4 



17.7 ± 2.3 
27,3 ± 2.6 

5.3 ± 1.3 

12.8 ± 2.0 

11.8 ± 1.9 

24.0 ± 2.4 
1.5 ± 0.7 

23.1 ± 2.4 

12.9 ± 1.9 



0.6 ± 0.6 
2.8 ± 1.2 
3.4 ± 1.4 

2.8 ± 1.2 
3.4 ± 1.4 
3.4 ± 1.4 

16.8 ± 2.8 

20.9 ± 3.0 
23.9 ± 3.2 
19.2 ± 2.9 

21.5 ± 3.1 
5.2 ± 1.6 

20.9 ± 3.0 
21.9 ± 3.1 

17.6 ± 2.8 

18.0 ± 2.9 

19.1 ± 2.9 

6.9 ± 1.9 

11.7 ± 2.4 

11.1 ± 2.3 

24.8 ± 3.2 
4.0 ± 1.5 

27.2 + 3.3 
12.1 ± 2.5 



22.1 ± 5.6 
19.9 ± 5.3 

9.8 ± 4.0 
20.5 ± 5.5 
11.4 ±4.2 
17.4 ± 5.0 

17.7 ± 5.1 

47.8 ± 6.6 
1.6 ± 1.6 

45.2 ± 6.5 

7.9 ± 3.4 



" Overall nucleotide sequence divergence within coding regions. 



negative helix propensities in the CYP2 profile (Fig. la). 
However, the low helix propensities in this region are reason- 
able, because some CYP2 members, as well as P450 lOlA 
contain one or two prolines in the Helix E region. Our 
alignment is supported by the nearly identical shapes in other 
two secondary structure profiles (Fig. 1, 6 and c) and especially 
in the hydropathy profile (Fig. Id). As shown in Fig. 2, 
hydropathy and coil propensity are conserved better than the 
helix propensity in all P450 families examined. Hence the use 
of helix propensity alone is less reliable than our method that 
considers various types of information. 

The second point that supports our alignment is the scarcity 
of gaps within the suggested helical or ^-structure regions in 
the alignments of all CYP2 members. It is noteworthy that 
the sequences in Fig. 3 represent all the gaps within the CYP2 
alignment, except those located at the two ends. CYP2D 
members have 3 additional residues and CYP2E members 
have 1 less residue in the Helix B' region than other CYP2 
members. Helix B' was not detected in P450 101 A by x-ray 
crystallography at 2.6-A resolution (Poulos et al., 1985) but 
was seen in the 1.63-A refined structure (Poulos et al., 1987). 
It is not surprising that the regular secondary structure in 
this region is not conserved in other P450 proteins, since this 
B' region is one of the major substrate recognition sites (Figs. 
3 and 4). The deletions found in the |32 region of some CYP2 
members are likely to occur in the turn or loop that connects 
opposite strands in the body of the antiparallel ^(-structure. 
Besides the gaps in the B' and /32 regions, only two single- 
residue deletions are found near an end of Helix G and of /S5. 
The gaps in Helices A and B in the lOlA sequence were 
introduced in the course of alignment of bacterial sequences, 
indicating somewhat variable natures of these short helices. 
Otherwise, the distributions of gaps in both lOlA and CYP2 
sequences are well consistent with the general tendency for 



deletion or insertion of residues to occur outside secondary 
structures (Lesk et al., 1986). 

The third and most important point is the fact that our 
alignment accords nearly perfectly with experimentally iden- 
tified substrate recognition sites in various CYP2 members. 
As shown in Fig. 3, all the known point mutations and 
chimeric fragments that significantly affect substrate speci- 
ficities are mapped closely to the alignment-based SRSs. The 
importance of SRS-1 (B'-C area) for binding substrates has 
been repeatedly noted (Kronbach et al, 1989; Uno and Imai, 
1989; Aoyama et al., 1989) and well explained by extrapolation 
from the three-dimensional structure of the camphor-bound 
form of P450 101 A (Poulos et al, 1985, 1987). Similarly, an 
important role of a part of the distal helix (SRS-4) in binding 
of substrates has been well documented (Poulos et al, 1985, 
1987; Imai and Nakamura, 1988, 1989; Furuya et al., 1989a, 
1989b; Zhou et al., 1991). There are, however, few reports of 
structure-based interpretations of the other sites. For exam- 
ple, the structural basis of the critical amino acid residue 209 
that determines testosterone hydroxylase activity of mouse 
P450 2A4 or coumarin 7-hydroxylase activity of 2A5 (Lind- 
berg and Negishi, 1989) has been enigmatic (Iwasaki et al., 
1991). Now this residue is mapped in SRS-2 (F-G interhelical 
region) close to the C terminus of Helix F. The region span- 
ning residues 211-262, which is essential for rabbit P450 2C2 
(laurate a-\ hydroxylase) to bind fatty acids (Imai, 1988), 
covers another SRS' (SRS-3) located at the N terminus of 
Helix G. The flexibility of the F-G region of P450 lOlA 
monitored by temperature factors markedly reduces upon 
substrate binding (Poulos et al., 1986). Hence the region was 
thought to be primarily important for accommodation of 
substrate molecules (Gotoh and Fujii-Kuriyama, 1989), al- 
though the corresponding region in eukaryotic sequences was 
only roughly assigned in our earlier report. The F-G inter- 
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helical region is the longest insertion in CYP2 sequences 
compared with bacterial ones. It is likely that this enlarged 
region facilitates accommodation of such large molecules as 
polycyclic hydrocarbons and steroids. A single amino acid 
substitution (Ile^ Phe) in rat P450 2D1 resulted in de- 
creased catalytic activity toward bufuralol but not debriso- 
quine (Matsunaga et al, 1990b). This residue is mapped in 
SRS-5 (|S3 area), where one of the 3 residues responsible for 
the altered substrate specificities of mouse 2A4 and 2A5 
(Lindberg and Negishi, 1989) is also located. Uno et al. (1990) 
noted that replacement of the C-terminal 28 residues of P450 
2C2 with those of 2C14 produced a new stereospecific activity 
toward testosterone, suggesting participation of this region in 
substrate recognition. This C-terminal area overlaps SRS-6 
within the |85-structure. All these observations not only con- 
firm our assignment of SRSs, but also afford structural and 
functional bases for our alignment. 

Our assignment of SRSs is further supported by the results 
of analyses of local variations in nucleotide substitution pat- 
terns. Most interestingly, most of the main peaks in Ai?, plots 
fall within SRSs (Fig. 4), and the associations between SRSs 
and positive ARi values are highly significant for each of four 
CYP2 subfamilies (Table I). This strong association implies 
the functional importance of high amino acid replacement 
rates within SRSs and supports our initial hypothesis that 
duplicate genes coding for drug- metabolizing P450 enzymes 
with divergent substrate specificities are evolutionarily advan- 
tageous. In accordance with this hypothesis, there have been 
several reports indicating relationships between overall P450 
activities and food habits (Krieger et al, 1971; Ronis and 
Hod^on, 1989). 

In the case of highly polymorphic human or mouse major 
histocompatibility complex genes, nonsynonymous substitu- 
tion rates within the antigen recognition site were higher than 
synonymous rates, providing strong evidence for adaptive 
overdominant selection (Hughes and Nei, 1988, 1989). As 
listed in Table II, nonsynonymous rates within SRSs ofCYPZ 
genes are slightly lower than synonymous rates. The average 
ratio of p„/p, = 0.74 ± 0.35 does not by itself indicate whether 
the higher nonsynonymous rates within SRSs than those 
outside SRSs is due to neutral mutations or adaptive diver- 
siflcations. The latter possibility is the more likely from 
circumstantial evidence, but further analyses, preferably with 
a larger set of sequence data, may be required to draw a 
conclusion on the molecular evolutionary mechanism. 

I also examined nucleotide substitution patterns of three 
other drug-metabolizing P450 families (data not shown). Nu- 
cleotide changes between two members of the CYPl family, 
lAl and 1A2, are nearly saturated, giving rise to a featureless 
ARi pattern. The substitution pattern of the CYP3A subfamily 
was basically similar to those of CYP2 subfamilies (Fig. 4), 
suggesting that CYP2 and CYP3 have undergone common 
evolutionary processes. On the other hand, the pattern for 
CYP4A was quite different from those of CYP2 or CYP3. 
The functional and/or structural reasons for this remain to 
be elucidated. 

In summary, several independent lines of evidence indicate 
that substrate recognition regions in CYP2 proteins are dis- 
persed along the primary structure. This paper reports six 
such regions, SRS-1-6, which constitute about 16% of the 
total residues in the P450 molecule. It is very likely that 
corresponding regions in other eukaryotic families of P450 
are involved in binding specific substrates, although the six 
SRSs may differ in relative importance in different families 
or subfamilies. Most of the residues that participate in sub- 
strate binding are probably included in these six SRSs, al- 



though a few sites may be missing from the present list The 
present findings wiU be useful for molecular design of engi- 
neered P450 enzymes with new substrate specificities. 
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